Compound's classification based on their SMILES representation

qHTS for Inhibitors of human tyrosyl-DNA phosphodiesterase 1 (TDP1): qHTS in cells in absence of CPT

Etapa 1 e 2

Carla Rafaela Silva, pg42862; José Pereira, pg42871; Tiago Silva, pg42885.

Introduction

Human tyrosyl-DNA phosphodiesterase 1 (TDP1) is a novel repair gene, and we propose to use it as a new target for anticancer drug development. TDP1 is not an essential protein, but under treatment with topoisomerase I poison (camptothecin: CPT), TDP1 works as a critical factor for cell survival. To directly identify novel TDP1 inhibitors active in a cellular environment, we have knocked-out the Tdp1 gene in chicken DT40 cells (Tdp1-/-) and generated a complemented counterpart cells that contains a stable transfection of the human TDP1 gene (Tdp1-/-;hTDP1 cells). For the primary screen, Tdp1-/-;hTDP1 cells will be exposed to small molecules in the presence or absence of CPT, and their growth kinetics will be evaluated after 48 hours by measuring ATP activity. If a given compound shows a synergistic effect with CPT, this compound could inhibit the repair pathway of CPT-induced lesions including the TDP1-mediated repair pathway. The hit compounds will then be evaluated in the presence or absence of CPT using Tdp1-/- cells. If a compound shows synergistic effect with CPT in Tdp1-/-;hTDP1 cells, but not with Tdp1-/- cells, such compound could be involved in the TDP1-mediated repair pathway inhibition. In tertiary assays, biochemical gel-based assays will be used to assess whether the hit compounds specifically target TDP1.

Imports

Initial exploration

Import dataset

The first step, analysing this dataset, includes loading and displaying TDP1 data.

Simple Analyses

This following step was taken to analyse how data presents itself along the lines and columns of the dataset.

This dataset was loaded under the name 'dataset'. It has 40,000 distinct molecules and 48 variables. In total, there are 1,920,000 data entries.

ColumnsName Description
PUBCHEM_RESULT_TAG This column contains an increasing number starting from one.
PUBCHEM_SID PubChem SubstanceID
PUBCHEM_CID PubChem CompoundID
PUBCHEM_ACTIVITY_OUTCOME This field allows the submitter to make an expert judgment call about the activity of each test result. Using a number, the value is set to 1 (inactive) or 2 (active) based on whatever means appropriate. In addition to active/inactive, this field can also be set to 3 (inconclusive), 4 (unspecified) or 5 (probe). The 'probe' designation indicates that the activity of the test result has been tested and confirmed though multiple rounds of experimental inquiry
PUBCHEM_ACTIVITY_SCORE The activity of a test result may be assigned a normalized score between 0 and 100 where the most active result rows have scores closer to 100 and inactive closer to 0, so that one can rank the result based on this data and prioritize hits
PUBCHEM_ACTIVITY_URL An URL may optionally be provided for Assay Data reported for this Substance in this column.
PUBCHEM_ASSAYDATA_COMMENT Textual annotation and comments
Potency Concentration at which compound exhibits half-maximal efficacy
Efficacy Maximal efficacy of compound, reported as a percentage of control
Analysis Comment Annotation/notes on a particular compound's data or its analysis
Activity_Score Activity score
Curve_Description A description of dose-response curve quality
Fit_LogAC50 The logarithm of the AC50 from a fit of the data to the Hill equation (calculated based on Molar Units)
Fit_HillSlope The Hill slope from a fit of the data to the Hill equation
Fit_R2 R^2 fit value of the curve. Closer to 1.0 equates to better Hill equation fit
Fit_InfiniteActivity The asymptotic efficacy from a fit of the data to the Hill equation
Fit_ZeroActivity Efficacy at zero concentration of compound from a fit of the data to the Hill equation
Fit_CurveClass Numerical encoding of curve description for the fitted Hill equation
Excluded_Points Which dose-response titration points were excluded from analysis based on outlier analysis
Max_Response Maximum activity observed for compound (usually at highest concentration tested)
Activity at xx uM* % Activity at given concentration
Compound QC NCGC designation for data stage: 'qHTS', 'qHTS Verification', 'Secondary Profiling'
smiles SMILES (Simplified Molecular Input Line Entry System) is a chemical notation that allows a user to represent a chemical structure in a way that can be used by the computer.

*Activity at xx uM refers to all columns that shows the activity of a molecule at a certain concentration.

Pre-Processing

The number of non attributed values (NA's) will be counted.

Visualization of the NA's

As we can see, there are a few columns that are completly filled by NA's, such as "PUBCHEM_ASSAYDATA_COMMENT" and "Analysis Comment". Therefore, these columns do not provide any type of information to the dataset. It is import to note that are 10 molecules with missing SMILE.

We can observe that more than 50% of all data entries are NA's.

Drop specific features

3 columns consisting only of NA's were removed, which reduced the dataset to 45 columns in total. Columns whose information will not be useful for further analysis were also removed. More specifically, the columns "PUBCHEM_ACTIVITY_URL" and "Compound QC" have been removed, reducing the column total to 43. The 10 molecules that did not have SMILE notation were removed from the dataset.

To help with future analysis, the "PUBCHEM_ACTIVITY_OUTCOME" categorical variable was transformed into a binary variable.

Graphic Exploration

Activity_outcome and Phenotype

As we can see in the "PUBCHEM_Activity_Outcome" pie chart, the data is balanced for binary classification. In the "Phenotype" pie chart, the overall multiclass is very imbalanced. However, the data is balanced between the 'Inactive' and 'Inhibitor' phenotypes.

Boxplots of Activity at 0.00299 uM, 0.363 uM, 1.849 uM, 9.037 uM and 46.23 uM

Compound Standardization

The molecules are standardized in order to remove isotope information, neutralize charges, remove stereochemistry and remove smaller fragments. This way we are accounting for molecular diversity.

Feature Generation

This step is divided in two ways: molecular descriptors and molecular fingerprints. Molecular Descriptors are an abstract representation of certain structural features of a molecule. These descriptors may represent a structural key within a molecule. This might be as simple as a count of a particular atom type, it might be the presence of a particular ring system and/or it might be a calculated property. Molecular Fingerprints are more abstract than a structural key but have the advantage of being more general since they do not represent pre-defined patterns.

Molecular Descriptors

Create dataframe with feature names

As we can see, after generating the molecular descriptors, we ended up with 208 features.

We selected 4 of these descriptors to further examine the distribution of these characteristics. ExactMolWt corresponds to the molecular weight of the molecule. NumAromaticRings enumerates the amount of aromatic rings. RingCount enumerates the amount of rings. TPSA or topological polar surface area corresponds to the polar surface area of the molecule.

Comparing both box plots, we can observe that the median molecular weight is slightly higher on the active molecules.

Comparing both box plots, we can observe that the median ring count is slightly higher on the active molecules.

Comparing both box plots, we can observe that the median number of aromatic rings is slightly higher on the active molecules.

Unlike the previous results, we can observe that the active molecules have a slightly lower median topological polar surface area than the inactive ones.

Normalize Data

Molecular Fingerprints

We are gonna study three different ways of constructing fingerprints. MorganFingerprint, RDKFingerprint and MACCSkeysFingerprint.

Both Morgan and RDK fingerprint techniques produced 2048 features while MACCSkeys produced only 167 features.

Feature Selection

Variance is the measurement of the spread between numbers in a variable. It measures how far a number is from the mean and every number in a variable. The variance of a feature determines how much it is impacting the response variable. If the variance is low, it implies there is no impact of this feature on response and vice-versa. To select the features with the most variance, we applied the boruta algorithm to the molecular descriptors and selected 10% of the highest ranking features of the molecular fingerprints.

Molecular Descriptors

After the feature selection, the number o features was reduced almost in half, droping from 208 to 82 features. ExactMolWt, NumAromaticRings, RingCount and TPSA were maintained after feature selection.

Molecular Fingerprints

After the feature selection, the number o features was reduced from 2048 to 205 features on the Morgan and RDK fingerprints. The MACCSkeysFingerprint was reduced to 167 features.

Unsupervised exploration

Principal Component Analysis (PCA) is a dimension-reduction tool and a statistical procedure that can be used to reduce a large set of variables to a small set that still contains most of the information of the large set. It uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components (PC). This procedure can explain the variance-covariance structure of the data.

t-distributed Stochastic Neighbor Embedding (t-SNE) is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centroid), serving as a prototype of the cluster.

Descriptors

Principal Component Analysis (PCA)

The first two principal components explain 45% of the data variance. To explain 95% of the variance, 29 principal components are required.

The PCA graph shows that our data does not distribute well along the first principal component which explains 34% of the variance. The is a separation along the second principal component, which explains 10% of the variance. In general, it is difficult to distinguish between active and inactive molecules.

t-distributed Stochastic Neighbor Embedding (t-SNE)

From the t-SNE graph, we observe a small separation of the data along the first dimension.

k-Means

Accordingly to the k-Means graph, there is no clear separation between the clusters.

Fingerprints

MorganFingerprint

Principal Component Analysis (PCA)

The first two principal components explain 44% of the data variance. To explain 95% of the variance, 34 principal components are required.

The PCA graph shows that our data does not distribute well along the first two principal component which explains 44% of the variance. In general, it is difficult to distinguish between active and inactive molecules.

t-distributed Stochastic Neighbor Embedding (t-SNE)

From the t-SNE graph, we observe that there is no clear separation between the dimensions.

k-Means

Accordingly to the k-Means graph, there is no clear separation between the clusters.

RDKFingerprint

Principal Component Analysis (PCA)

The first two principal components explain 88% of the data variance. To explain 95% of the variance, 12 principal components are required.

The PCA graph shows that our data has a small seperation along the first principal component which explains 83% of the variance. There is also a small separation of the data along the second principal component which explains 4% of the variance. Even though, the first two principal components explains 88% of the variance, it is still difficult to distinguish the molecules accordingly to their activity.

t-distributed Stochastic Neighbor Embedding (t-SNE)

From the t-SNE graph, we observe that there is a small separation along the first dimension.

k-Means

Accordingly to the k-Means graph, there is a clear separation between the clusters.

MACCSkeysFingerprint

Principal Component Analysis (PCA)

The first two principal components explain 64% of the data variance. To explain 95% of the variance, 9 principal components are required.

The PCA graph shows that our data does not distribute well along the first two principal component which explains 64% of the variance. In general, it is difficult to distinguish between active and inactive molecules.

t-distributed Stochastic Neighbor Embedding (t-SNE)

From the t-SNE graph, we observe that there is no separation of the data along the two dimensions.

k-Means

Accordingly to the k-Means graph, there is no clear separation between the clusters.

Conclusion

From the TDP1 activity dataset, we extracted the SMILE and activity of all the molecules. Using the SMILE we were able to obtain two types of features: descriptors and fingerprints. These were related with the active state of the corresponding molecule. These two types of features were examined using PCA and clustering. However, these analysis were inconclusive. Therefore, it is difficult to distinguish the molecules according to their active state. Nonetheless, we think that it is possible to proceed to supervised learning using the descriptors and the RDKFingerprint technique (being this the one that achieved the better results). We trust that we can obtain better results in the classification of the molecules's active state.